Initial projection pushdown optimization #113

jonmmease · 2022-05-26T18:13:58Z

Overview

This PR introduces a framework for identifying the usage of columns within datasets and uses that to add a "projection pushdown" optimization pass to the planner.

Column Usage

A key construct introduced by this PR is that of the "Column Usage" of a dataset. The column usage of a dataset can either be a known set of columns, or it can be unknown. This is represented in Rust by the ColumnUsage enum. When a dataset it used in multiple contexts (e.g. multiple encoding channels) the usages for each context can be combined with the following sort of union operation:

If both usages are "known" then the union is the set union of the known columns.
If either usage is unknown, the resulting column usage is also unknown.

The column usage must be maintained for every dataset in the specification individually.

Projection pushdown

Here is the outline of the projection pushdown optimization:

First, scan the entire specification to identify every column usage, unioning theses usages together per dataset.
Next, iterate over all datasets in the specification. If the dataset has a known usage, then append a Vega project transform to the dataset's transform array which will downselect the columns to include only those that are used elsewhere in the specification. No change is made to datasets with unknown column usage.

Support and Limitations

Encoding

This PR includes fairly precise determination of column usage within marks. In particular, it correctly identifies the usage of columns in various forms of encoding channels. For example, it will identify the usage of columns "one", "two", "three", and "four" in the following encoding specification:

        {
            "update": {
                "x": {"field": "one", "scale": "scale_a"},
                "y": [
                    {"field": "three", "scale": "scale_a", "test": "datum.two > 7"},
                    {"signal": "datum['four'] * 2"},
                ]
            }
        }

Scales

It will also identify the precise use of columns in scale domains that are computed from a dataset field.

Transforms

This PR does not include support for identifying the precise usage of columns within transform pipelines. So if a dataset is used as the "source" of a derived dataset then it's column usage will be unknown, and no projection transform will be added.

Most of the infrastructure is in place to add this support in the future.

vlSelectionTest

When selections are used, Vega-Lite generates expressions that use the special vlSelectionTest('store', datum) function. Determining the column usage for this expression is complex because the columns used are determined by the contents of a secondary "store" dataset. If the fields contained in the secondary store dataset are known, the logic in this PR will correctly make use of them. But the PR does not contain any logic to determine the contents of secondary store datasets. Currently, the use of vlSelectionTest will result in unknown column usage.

comment out playfair mock test which uses a named mark as a dataset.

jonmmease added 18 commits May 13, 2022 10:34

Add column usage enum

d2e77ea

Add support for detecting column usage in expressions

f6b428c

Extract GetColumnUsage trait

534ee41

Add logic to detect column usage in mark encodings

b4fbf57

Merge branch 'main' into projection_pushdown

25ccd7a

WIP DatasetsColumnUsage and GetDatasetColumnUsage trait

323eb2b

Merge branch 'main' into projection_pushdown

5417755

Add column usage detection for marks and scales

7602d71

Add column usage detection for signals

4454885

Add column usage detection for group marks and full charts

868b1cc

Add support for project transform

7a7b2e2

clippy fix

95b488b

Add projection pushdown optimization

690280a

Add initial dataset aliases support

a5f602c

Don't abort projection pushdown on named mark,

bbc2533

comment out playfair mock test which uses a named mark as a dataset.

Add specific projection_pushdown tests

0d6e1f6

cargo fmt

a8a36a8

package-lock.json update

f707538

jonmmease changed the title ~~[WIP] Initial projection pushdown optimization~~ Initial projection pushdown optimization May 26, 2022

clippy fix

219c997

This was referenced May 26, 2022

Projection pushdown: Determine precise column usage of dataset transform pipelines #114

Closed

Projection Pushdown: Determine fields used by vlSelectionTest #115

Closed

jonmmease merged commit e0d24a4 into main May 26, 2022

jonmmease mentioned this pull request May 28, 2022

Detect precise column usage in data transforms and vlSelectionTest expressions #117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial projection pushdown optimization #113

Initial projection pushdown optimization #113

jonmmease commented May 26, 2022 •

edited

Loading

Initial projection pushdown optimization #113

Initial projection pushdown optimization #113

Conversation

jonmmease commented May 26, 2022 • edited Loading

Overview

Column Usage

Projection pushdown

Support and Limitations

Encoding

Scales

Transforms

vlSelectionTest

jonmmease commented May 26, 2022 •

edited

Loading